Process Documents from Mail Store (Text Processing)
Synopsis
Generates word vectors from a text collection stored in an IMAP or POP3 mail server.Input
- word list
The word list port.
- connection (Connection)
This port can take a connection of type Mail (retrieve).
Output
- example set (Data table)
The example set port.
- word list
The word list port.
- connection (Connection)
If the input port connection has data, it will be put through to this output port.
Parameters
- mail account The mail connection to use to retrieve the email. Only visible if the connection input port is not connected and the compatibility level is above 9.3.1.
- create word vectorIf checked, the tokens of a document will be used to generate a vector numerically representing the document.
- vector creationSelect the schema for creating the word vector.
- add meta informationIf checked, available meta information of the text like filename, date is added as attribute.
- keep textIf checked, the input text will be stored as a special String attribute with the role text.
- prune methodSpecifies if to frequent or to infrequent words should be ignored for word list building and how the frequencies are specified.
- prune below percentIgnore words that appear in less than this percentage of all documents.
- prune above percentIgnore words that appear in more than this percentage of all documents.
- prune below absoluteIgnore words that appear in less than that many documents.
- prune above absoluteIgnore words that appear in more than that many documents.
- prune below rankWords are ordered by frequency and words with a frequency less than the frequency of the rank given by this percentage will be pruned.
- prune above rankWords are ordered by frequency and words with a frequency higher than the frequency of the rank given by this percentage will be pruned.
- datamanagementDetermines, how the data is represented internally.
- define storeMail store connection can be defined by using either a session bound to a JNDI name, or explicitly by specifying host and user.
- jndi nameJNDI name referencing a mail session.
- hostIMAP or POP3 host name
- userIMAP or POP3 user name
- passwordIMAP or POP3 password
- connection propertiesAdditional properties for the mail store.
- protocolIMAP or POP3
- only unseenIf checked, only new unseen messages will be processed.
- mark seenIf checked, all processed messages will be marked read. Only works with IMAP, not with POP3.
- delete messagesIf checked, all processed messages will be deleted. Especially useful for POP3
- recursiveRecurse into subfolders?
- folderName of the IMAP folder to scan. Must be INBOX for POP3.
- download attachmentsselect to download mails and attachments
- attachment file-patternA pattern for the attachment you want to select. Usual wildcards like ? and * are supported.
- attachment MIME-typetype in the MIME-type you want to select.(if this label and all additional labels are empty all MIME-types are selected)
- parallelize vector creationDetermines whether the execution of Vector Creation should be parallelized.